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Abstract 



In this article, we derive concentration inequalities for the cross-validation estimate of the 
generalization error for empirical risk minimizers. In the general setting, we prove sanity-check 
bounds in the spirit of Kearns et al. ( 1999 1 "bounds showing that the worst-case error of this 
estimate is not much worse that of training error estimate " . General loss functions and class of 
predictors with finite VC-dimension are considered. We closely follow the formalism introduced 
by Dudoit et al. ( 2003 1 to cover a large variety of cross-validation procedures including leave-one- 
out cross-validation, fc-fold cross-validation, hold-out cross-validation (or split sample), and the 
leave-Li-out cross-validation. 



In particular, we focus on proving the consistency of the various cross-validation procedures. We 
point out the interest of each cross-validation procedure in terms of rate of convergence. An 
estimation curve with transition phases depending on the cross-validation procedure and not only 
on the percentage of observations in the test sample gives a simple rule on how to choose the 
cross-validation. An interesting consequence is that the size of the test sample is not required to 
grow to infinity for the consistency of the cross-validation procedure. 



Keywords: Keywords : Cross-validation, generalization error, concentration inequality, optimal 
splitting, resampling. 
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1. Introduction and motivation 



Pattern recognition (or classification or discrimination) is about predicting tlic unknown nature of 
an observation: an observation is a collection of numerical measurements, represented by a vector 
X belonging to some measurable space X . The unknown nature of the observation is denoted by y 
belonging to a measurable space y. In pattern recognition, the goal is to create a measurable map 
(j) : X ^ 4>{x) which represents one's prediction of y given x. The error of a prediction (f){x) when 
the true value is y is measured by L{y, (j>(x)), where the loss function L ^ y^ M+. For simplicity, 
we suppose L < 1. In a probabilistic setting, the distribution P of the random variable {X, Y) G Xxy 
describes the probability of encountering a particular pair in practice. The performance of cp, that 
is how the predictor can predict future data, is measured by the risk i?(0) := E(x,y)^(y, 't'i^))- In 
practice, we have access to n independent, identically distributed (i.i.d.) random pairs {Xi, Yi)i<ci<n 
sharing the same distribution as {X,Y) called the learning sample and denoted I?„. A learning 
algorithm $ is trained on the basis of I?„. Thus, $ is a measurable map from X x Un{X x 3^)" to y. 
Y is predicted by <I>(X, I?„). The performance of <!>(., I?„) is measured by the conditional risk called 
the generalization error denoted by i?„ := E(x^y) [L{Y, ^{X, | Vn] with {X, Y) ^ ¥ independent 
of Vn and with the following equivalent notation for the conditional expectation of h{X, Y) given 
Y: Exh{X,Y). In the following, if there is no ambiguity, we will also allow the notation (j){X,'Dn) 
instead of <1>(X, 2?„). Notice that i?„ is a random variable measurable with respect to 

An important question is: The distribution P of the generating process being unknown, can we es- 
timate how good a predictor trained on a learning sample of size n is? In other words, can we 
estimate the generalization error Rn ? This fundamental statistical problem is referred to " choice 



and assessment of statistical predictions" Stone ( 1974 ) . Many estimates have been proposed, among 



them the resubstitution estimate (or training estimate). The predictor is trained using the entire 
learning sample and an estimate of the prediction is obtained by running the same learning pro- 
cess through the predictor and comparing predicted and actual responses. Thus, the resubstitution 
estimate i?„ := ^J27=i^0^i'^i-^i>'^n)) can severely underestimate the bias. It can even drop to 
zero for some machine learning even though the generalization error is nonzero (for example, the 
1— nearest neighbor). The difficulty arises from the fact that the learning sample is used both for 
training and testing. In order to get rid of this downward bias, the estimation of the generalization 



error based on sample reuse have been favored among practitioners. Quoting Hastie et al. (20011: 
Probably the simplest and most widely used method for estimating prediction error is cross-validation. 
However, the role of cross-validation estimator, denoted by Rev , is far from being well understood 
in a general setting. In particular, the following problems remain partially solved: "Is Rev a good 
estimator of the generalisation error?", "How should one choose k in a /c-fold cross-validation" or 
"Does cross-validation outperform the resubstitution error ?". The purpose of this paper is to give 
a partial answer to the first two questions. 

We introduce our main result for symmetric cross-validation procedures. We divide the learning 
sample into two samples: the training sample and the test sample, to be defined below. We denote 
by pn the percentage of elements in the test sample such that np„ is an integer. For empirical risk 
minimizers over a class of predictors with finite VC-dimension Vc , to be defined below, we have the 
following concentration inequality, for all e > 0: 

Pr(|^cy"^n| >£) <B{n,pn,e) + V{n,pn,e), 

with 

• B{n,pn,s) = 5(2n(l -p„) -f 1)t^ exp( — — ) 

64 



^ ■ f ^Pns\ 16 yc(ln(2(l-p„) + l)+4) 

• F n,p„,£ = mm exp 7^),—\ ^ ^ 

V 2 £ y n{l-pn) 
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The term B{n,pn,s) is a Vapnik-Chernovenkis-type bound controlled by the size of the training 
sample n(l — p„) whereas the term V{n,pn,e) is the minimum between a Hoeffding-type term 
controlled by the size of the test sample np„, a polynomial term controlled by the size of the 
training sample. As the percentage of observations in the test sample p„ increases, the V{n,pn,e) 
term decreases but the B{n,pn,e) term increases. 

The difference from the previous results on estimation of i?„ is in the following: 

• our bounds for intensive cross-validation procedures (i.e. fc-fold cross-validation or leave-u-out 
cross-validation) are not worse than those for hold-out cross-validation. 

• our inequalities not only depend on the percentage of observations in the learning sample Pn 
but also on the precise type of cross-validation procedure: this is why we can discriminate 
between /c-fold cross-validation and hold-out cross-validation even if p„ is the same. 

• we show that the size of the test sample does not need to grow to infinity for the cross-validation 
procedure to be consistent for the estimation of the generalization error. 



Using these probability bounds, we can then deduce that the expectation of the difference between 

the generalization error and the cross-validation estimate E-p^ \Rcv—Rn\ is of order 0„(y^Vc ln(n(l — p„))/n(l — p„)-|- 

yjljrvpn)- As far as E-p„| i?cv — Rn\ is concerned, we can define a splitting rule: the percentage of 

elements Pn in the test sample should be proportional to ^^73' i-^- larger the class of predictors 

is, the smaller the test sample in the cross-validation should be. 

The paper is organized as follows. In the next section, we give a short review of literature. We detail 
the main cross-validation procedures and we summarize the previous results for the estimation of 
generalization error. In Section 3, we introduce the main notations and definitions. Finally, in 
Section 4, we introduce our results, in terms of concentration inequalities. In companion papers, we 
will show that in some cases, the cross-validation estimate can outperform the training estimate and 
prove that cross-validation can work out with infinite VC-dimension predictor. 



2. Short Review of the Hterature on cross- vaHdation 



The cross-validation Rev includes leave-one-out cross-validation, fc-fold cross-validation, hold-out 
cross-validation (or split sample), leave-u-out cross-validation (or Monte Carlo cross-validation or 
bootstrap cross-validation). In leave-one-out cross-validation, a single sample of size n is used. Each 
member of the sample in turn is removed, the full modeling method is applied to the remaining n—l 
members, and the fitted model is applied to the hold-backmember. An early (1968) application of 
this approach to classification is that of Lachenbruch et al. (1968). Allen ( 1968) gave perhaps the 



first application in multiple regression and Geisser ( 1975 ) sketches other applications. However, this 
special form of cross-validation has well-known limitations, both theoretical and practical, and a 



number of authors have considered more general multifold cross-validation procedures Breiman et 



al. ( 


1984 


) ; 


Breiman et al. 


(1992 


) ; Burman 


(1989 


) ; 


Devroye et al. 


1996 


) ; 


Geisser 


( 


1975 


) ; 


Gyorfi 


et al. 


(2002 




McCarthy ( 


L976 


) ; Picard et al. 


(1984 


1 ; 


Ripley ( 


199e 


1 ; 


Shao 


(1993 




Zhang 


(1993) 



). The fc-fold procedure divides the learning sample into fc equally sized folds. Then, it produces 
a predictor by training on fc — 1 folds and testing on the remaining fold. This is repeated for each 
fold, and the observed errors are averaged to form the fc-fold estimate. Leave-u-out cross-validation 
is a more elaborate and expensive version of cross-validation that involves leaving out all possible 
subsamples of v cases. In the split-sample method or hold-out, only a single subsample (the training 
sample) is used to estimate the generalization error, instead of fc different subsamples; i.e., there is 
no crossing. Intuitively, there is a tradeoff between bias and variance in cross-validation procedures. 
Typically, we expect the leave-one-out cross-validation to have a low bias (the generalization error 
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of a predictor trained on n — 1 pairs should be close to the generalization error of a predictor trained 
on the n pairs) but a high variance. Leave-one-out cross-validation often works well for estimating 
generalization error for continuous loss functions such as the squared loss, but it may perform poorly 
for discontinuous loss functions such as the indicator loss. On the contrary, fc-fold cross-validation 
or leave-u-out cross-validation are expected to have a higher bias but a smaller variance due to 
resampling. 



With the exception of Burman ( 1989 ) , theoretical investigations of multifold cross-validation pro- 
cedures have first concentrated on linear models (0(1987) Shao (1993) Zhang (1993)). Results of 



Devroye et al. (1996) and Gyorfi et al 



(2002) are discussed in Section 3. The first finite sample 



results are due to Devroye et al. (1979) and concern A:-local rules algorithms under leave-one-out 
and hold-out cross-validation. More recently, Holden ( 1996a|b ) derived finite sample results for the 
hold-out, A:— fold and leave-one-out cross-validations for finite VC algorithms in the realisable case 
(the generalization error is zero). But the bounds for fc— fold cross-validation are k times worse than 
for hold-out cross-validation. Blum et al. (1999) have emphasized when fc— fold can out perform 



hold-out cross-validation in a particular case of fc-fold predictor. Kearns et al. ( 1999 ) has extended 



such results in the case of stable algorithms for the leave-one-out cross-validation procedure. |Kearns| 
et al. ( 1995 ) also derived results for hold-out cross-validation for VC algorithms without the realis- 



able assumption. However, the bounds obtained are "sanity check bounds" in the sense that they are 



not better than classical Vapnik-Chernovenkis's bounds. Van Der Laan et al. (2004) derived finite 



sample results for the distance between the cross-validation estimate and a special benchmark and 
proved asymptotic results for the relation between the cross-validation risk and the generalization 
error. To our knowledge, bounds for intensive cross-validation procedures are missing. This might 
be due to the lack of independence between the crossing terms of the cross- validated estimate [Kearns] 

eFaLldlggsl). 



3. Notations and definitions 

We introduce here useful definitions to define the various cross-validation procedures. First, we 
define binary vectors, i.e. Vn = {Vn,i)i<i<n is a vector of size n, such that for all i, Vn,i G {0, 1} and 
J2i ^n,i 7^ 0. Consequently, knowing the binary vector, we can define the subsample associated with 
it: Vy^ := {{Xi,Yi) G 'Dn\Vn,i = n}. The weighted empirical error of Lp is denoted by 

Rv„ {(j)) and defined by: 

1 " 

i?v„W := „ ^K.»£(y„0(XO). 

For Ri^ , with 1„ the binary vector of size n with 1 at every coordinate, we will use the simpler 
notation i?„. For a predictor trained on a subsample, we define: 

0y„(.) :=<f(.,l?yj. 

With the previous notations, notice that the predictor trained on the learning sample 4){.,'Dn) can 
be denoted by ?!'i„(.). We will allow the simpler notation 0„(.). The learning sample is divided into 
two disjoint samples: the training sample of size n(l — p„) and the test sample of size np„, where 
Pn is the percentage of elements in the test sample. To represent the training sample, we define a 
random binary vector V^^ of size n independent of Vn- V^^ is called the training vector. We define 
the test vector by V*'^ := 1„ — Vji^ to represent the test sample. 

The distribution of Vj^^ characterizes all the cross-validation procedures described in the previous 
section. Using our notations, we can now define the cross-validation estimator. 

Definition 3.1 (Cross-validation estimator) With the previous notations, the generalized cross- 
validation error of (pn denoted by Rev is defined by the conditionnal expectation of Rvt={(l)v^^) with 
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respect to the random vector V^^ given r>„ ; 

Rev ■~ ^v^'~ Rv*" {4^v^'~) ■ 

We will give here some examples of distributions of V^'^ to show that we retrieve cross-validation 
procedures described previously. Suppose n/k is a integer. The fc-fold procedure divides the data 
into k equally sized folds. It then produces a predictor by training on k-1 folds and testing on the 
remaining fold. This is repeated for each fold, and the observed errors are averaged to form the 
fc-fold estimate. 

Example 3.1 (fc-fold cross-validation) 

Pr(e = ( , )) = i 

n/k observations n(l — 1/fc) observations 

Pr(V^„*'- = ( , , )) = ^, 

n/k observations n/k observations n(l — 2/fc) observations 

Pr(C = ( , )) = ^. 

n(l — l/fe) observations n/k observations 

Wc provide another popular example: the leave-one-out cross-validation. In leave-one-out cross- 
validation, a single sample of size n is used. Each member of the sample in turn is removed, the full 
modeling method is applied to the remaining n — 1 members, and the fitted model is applied to the 
hold-backmember. 

Example 3.2 (leave-one-out cross-validation) 

Pr(yf = (0,l,...,l)) = i 
Pr(C = (1,0,1,...,!)) = ^ 



Pr(C = (l,...,l,0)) = ^. 

We denote by Ropt the minimal generalization error attained among the class of predictors C, Ropt = 
inf^gc R{4')- In tlie sequel, we suppose that (f>n is an empirical risk minimizer over the class C. For 
simplicity, we suppose the infimum is attained i.e. = argmin^gc -Rn(0)- Notice that Ropt is a 
parameter of the unknown distribution P(x,f) whereas i?„ is a random variable. 

At last, recall the definitions of: 

Definition 3.2 (Shatter coefficients) Let A be a collection of measurable sets. For (2i,...,2„) 
e {M*^}", let N^{zi^,,,^Zn) be the number of differents sets in 

{{zi,...,z„}nA;AeA}. 

The n-shatter coefficient of A is 

S{n,A) = max Nj^{zi^,,,^Zn)- 

(zi,...,z„)G{'"'"" 



That is, the shatter coefficient is the maximal number of different subsets of n points that can be 
picked out by the class of sets A. 
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Definition 3.3 (VC dimension) Let A be a collection of sets with A > 2. The largest integer 
k > 1 for which S{k,A) = 2^ is denoted by Vq, and it is called the Vapnik-Chernovenkis dimension 
(or VC dimension) of the class A. If S{n,A) — 2" for all n, then by definition Vc = oo. 

A class of predictors C is said to have a finite VC-diniension Vq if the dimension of the collection of 
sets {A^^t ■ <j) ^C,t £ [0,1]} is equal to Vc, where A^^t — {{XtV)/ L{y,(j){x)) > t}. 



4. Results 

4.1 Hypotheses H 

In the sequel, we suppose that the training sample and the test sample are disjoint and that the 
number of observations in the training sample and in the test sample are respectively n(l — p„) 
and npn- Moreover, we suppose also that the is an empirical risk minimizer on a sample with 
finite VC-dimension Vc and L a loss function bounded by 1. We also suppose that the predictors 
are symmetric according to the training sample, i.e. the predictor does not depend on the order of 
the observations in I?„. Eventually, the cross-validation are symmetric i.e. Pi{Vji\ = 1) does not 
depend on i, this excludes the hold-out cross-validation. We denote these hypotheses by H. 

We will show upper bounds of the kind Pt{\Rcv — Rn\ > e) < B{n,pn,e) + V{n,pn,s) with 
e > 0. The term B(n,pn,e) is a Vapnik-Chernovenkis-type bound whereas the term V(n,pn,s) 
is a Hoeffding-like term controlled by the size of the test sample This bound gives can be 

interpreted as a quantitative answer to the bias-variance trade-off question. As the percentage of 
observations in the test sample p„ increases, the V{n,pn, e) term decreases but the B{n,pn, e) term 
increases. Notice that this bound is worse than the Vapnik-Chernovenkis-type bound and thus can 



be called a " sanity-check bound" in the spirit of Kearns et al. ( 1999 ) . Even though these bounds are 



valid for almost all the cross-validation procedures, their relevance depends highly on the percentage 
Pn of elements in the test sample; this is why we first classify them according to p„. At last, notice 
that our bounds can be refined using chaining arguments. However, this is not the purpose of this 
paper. 

4.2 Cross-validation with large test samples 

The first result deals with large test samples, i.e. the bounds are all the better if np„ is large. Note 
that this result excludes the hold-out cross-validation because it does not make a symmetric use of 
the data. 

Proposition 4.1 (Large test sample) Suppose that H holds. Then, we have for all e > 0, 

Pr{Rcv - Rn > e) < B{n,pn,e) + V{n,pn,e), 

with 

• B{n,pn,s) = 4(2n(l -p„) + 1)t=^^ '^''P^"^)' 

• V{n,pn,e) ^ exp( — — ). 

First, we begin with a useful lemma( for the proof, see Appendices) 
Lemma 4.1 Under the assumption of Proposition\4. 1\ we have for all e > 0, 



PTiEy,^supiRv^r{(b)-Ricj)))>e) < iS{2n{l - p„),C))^ e-^' , 
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and symmetrically 



Pi{Ey„.sup{R{<j>)-Rvt^{(l)))>e) < {S{2n{l - pn),C))—^ 



'pec 



Proof of proposition 4.1 



Recall that is based on empirical risk minimization. Moreover, for simplicity, we have supposed 
the infimum is attained i.e. ipn — argmin^gc Rn{(j)). Define _R„(]^_p-) := Eytri?(0ytr). 

We have by splitting according to ^„(i_p): 

Pr {Rev ~Rn> 5£) < Pr [Rev - Rn{i-p) > e) + Pr(i?„(i_p) -Rn> 4e) . 



Notice that Ex)„(i?cv' ~ Rn(i-p)) — 0- Intuitively, V corresponds to the variance term and is con- 
trolled in some way by the resampling plan. On the contrary, in the general setting, Ex)„ {Rn(i-p) ~ 
Rn) ^ 0, and B is the bias term and measures the discrepancy between the error rate of size n and 
of size n(l — p„). 

The first term V can be bounded via Hoeffding's inequality, as follows 



V = Pr(Eyt.(i?vta(,/)ytO - R{(t>v^r)) > s) 



< inf e^^'^E 

s>0 

Then, by Jensen's inequality, we have 



(by Chernoff's bound). 



V < inf e^^^Ep.Eyt.e 

s>0 



Thus, for v*'",v*^ fixed vectors, we have by linearity of expectation and the i.i.d assumption 



V < inf e^'^Ee 

s>0 



< inf e-'^Ev , E(e 

s>0 < 



S(fl„ts(0„],r)--R.(0^tr)) 



Finally, by lemma 1 in Lugosi (2003) since E(i?vf^^ (0vj,= ) — R{4>v^i^) I T^v*^) = and the conditional 
independence: 

V < inf e-"^EeS^ < e-^''^-^". 

s>0 

The second term may be treated by introducing the optimal error Ropt which should be close to 



B = Pr (i?„(i_p) - Rn > 4e) 

= Pr (E^^,.(i?(0y^O - R^,^ [dpvi^) + [dpvi") - Ropt) + Ropt - Rn > 4e). 

Using the supremum and the fact that (j)v^'- is an empirical risk minimizer, we obtain: 
B < Pr(E^t,, sup(i?((/i) - Rvtr-{(l)))+Evt- inf -Ri/j- - inf 



ct>ec 



4>ec 



+ Ropt — Rn + Rn — Rn > 4e) . 
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Then, since inf(A) — inf(_B) < sup(^ — B) and by definition of 0„, we deduce 

B<Pt {Ey,, sup(i?(0) - > e) + Pr (E^,4sup(Ev/t.(0) - i?(0)) > e) 

+ Pr ( sup(i?(0) - Rn{(f>)) > e) + Pr ( sup(i?„(0) - > e). 

0ec 4>ec 



Tlius, by Lemma |4.1[ we get 

B < 2(5(2n(l-p„),C))T^e-""' +25(2n,C)^e-""' 
Recall the following result (see e.g. Devroye et al. ( |1996 )) 



Vn,5(n,C) < + (1) 

Thus, we finally obtain 

B< 2(2n(l-p„) + l)w^e-"^ + 2(2n + l)^^'^e-"^ 
<4(2n(l-p„) + l)^e-"'='. 

□ 

Next, we obtain 

Proposition 4.2 (Large test sample) Suppose that H holds. Then, we have, for all e > 0, 

Pr(^„ - Rev >£)< (2ri + 1)"^^ exp{~ne^). 

Proof 

First, the following lemma holds (for the proof, see appendices). 
Lemma 4.2 Suppose that H holds, then we have Rev > Rn- 
Thus, 

Pr(^„ - Rev >e)< Pr (^„ - R,, > e) < S{2n,C)^e-'''' < {2n + lf^''e-'"'\ 

□ 

Using the two previous results, we have a concentration inequality for the absolute error \Rcv — Rn\, 

Corollary 4.1 (Absolute error for large test sample) Suppose that H holds. Then, we have, 
for all e > 0, _ ^ 

PT{\Rn-Rcv\ > e) < B{n,p„,e) + V{n,pn,s), 

with 

• B{n,pn,e) = 5(2n(l -p„) + 1)^^ exp( — — ), 

25 

• V{n,pn,e} = exp( — — ). 

With the previous concentration inequality, we can bound from above the expectation of \Rn — Rcv\- 



Corollary 4.2 (Li error for large test sample) Suppose that H holds. Then, we have, 
Proof. 

This is a direct consequence of the following lemma: 

Lemma 4.3 (Devroye et al. (1996])) Let X be a nonnegative random variable. Let K,C non- 
negative real such that C > 1. Suppose that for all s > V(X > s) < C exp(— ife^). Then: 



EX < 



ln(C) + 2 



K 



□ 



4.3 Cross-validation with small test samples 

The previous bound is not relevant for all small test samples (typically leave-one-out cross-validation) 
since we are not assured that the variance term converges to (in leave-one-out cross-validation, 
V{n,Pn,s) = exp(— 2e^/25)). However, under H, cross-validation with small test samples works 
also, as stated in the next proposition. 

Proposition 4.3 (Small test sample) Suppose that H holds. Then, we have, for all e > 0, 

Pt{Rcv - Rn > e) <B{n,pn,e) + V{n,pn,e), 



with 

• B{n,pn,s) = 4(2n(l -p„) + 1)^ exp(--— ), 

64 



. 1 / T/c(ln(2n(l-p„) + l)+4) \ 

For small test samples, we get the same conclusion but the rate of convergence for the term V is 
slower than for large test samples: typically 0„ (^'^\/^^ir{ i-p'")'^ ^ against 0„ (exp(— np„£'^)/8) . 

Proof. 

Now, we get by splitting according to Rn(i-p)' 

Pr {Rev -Rn> Ss) < Pr {Rev - i?„(i-p) > 4e) -t- Pr (i?„(i_p) - Rn > 4£) . 



2 

First, from the proof of proposition 4.4 we have B < 4(2n(l — Pn) + 1) ^-p^ e~"^ . 

Secondly, notice that K{Rcv — Rn{i-p)) = 0. To control V, we will need the following lemma (for the 
proof see appendices) which says that if a bounded random variable X is centered and is nonpositive 
with small probability then it is nonnegative with also small probability. 
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Lemma 4.4 // \X\ < 1 and EX 0. Then for all e > 0, we get 

V(X < -^x)dx 



Moreover, we have since Rev > Rn by lemma 4.2 



Pr (Rev - Rn(i-p) < -4£) < Pr (i?„ - i?„(i_p) < -4e) 

< Pr (i?„ - i?„ < -e) + Pr (i?„ - R^ii-p) < -3e). 

Using lemma ITT] it follows: 



Ft {Rev - Rn(i-p) < -4e) <5(2n,C)4e-^ " + 35(2n(l-p„),C)T^e- 
< 4(2n(l -Pn) + l)^ e-"^' . 

Applying lemmas |4.4| and inequality [l] allows to conclude. 
□ 

We have the following complementary but not symmetrical result; 

Proposition 4.4 (Small test sample bis) Suppose that % holds. Then, we have for all e > 0, 

P(^„ - Rev > e) < (2n + if^'^ exp(-ne2). 

Proof. 

We have since Rev > Rn- 

Pr {Rn - Rev > e) < Pr {Rn - Rn > e) < S{2n,C)^e-'''' < {2n + 1)^^" e"""' . 

□ 

From this result, we deduce that. 

Corollary 4.3 (Absolute error for small test sample ) Suppose that H holds. Then, we have 
for all e > 0, _ ^ 

Pr(|i?„ - Rev\ >£}< B{n,p„,e) + V{n,pn,e), 

• B{n,pn,£) = 5(2n(l -p„) + 1)t=^;^ exp(- — ) 

d4 



... s 16 K:(ln(2n(l-p„) + l)+4) 

• '^^'^'^"''^ = TV ■ 

Eventually, we get 

Corollary 4.4 (ii error for small test sample) Suppose that % holds. Then, we have: 



n{l~Pn) \ \ V K:(ln(2n(l-p„) + l)+4) 
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Proof. 

We just need lemma [473| and the following simple lemma 

Lemma 4.5 Let X a nonnegative random variable bounded by I, A > a real such that V{X > 
e) < 7, for all e > 0. Then, 

E{X) < A{l~ln{A)) 

□ 

Eventually, collecting the previous results, we can summarize the previous results for upper bounds 
in probability with the following theorem: 

Theorem 4.5 (Absolute error for cross-validation) Suppose that H holds. Then, we have for 
all e > 0, 

Pr(|i?„ - Rcv\ > e) < Bsy,„(n,p„, e) + Vsym{n,pn,e), 

with 

• Bsymin,pn,s) = 5(2n(l -p„) + 1)t^ exp(- — ) 

... . ■ ( ( 2np^ 16 / yc(ln(2(l-p„) + l)+4) ^ 

• Vsym{n,pn,e) = mm exp( — — ),—\ jz r . 

y 25 e y n[l-pn) J 

An interesting consequence of this proposition is that the size of the test is not required to grow to 
infinity for the consistency of the cross-validation procedure in terms of convergence in probability. 

4.4 fc-fold cross-validation 

For fc-fold cross-validation, we can simply use the previous bounds together. Thus, we get 
Proposition 4.6 (k-fold) Suppose that H holds. Then, we have for all e > 0, 

Pr(|.R„ - Rcv\ > e) <Bk{n,pn,e) + Vfe(n,p„,e) 

with 

• Bk(n,Pn,e) = 5(2n(l - 1/fc) -I- 1)t^ exp(- — ) 

64 

. ( , 2ne\ 16 / yc(ln(2(l ^ 1/fc) + 1) + 4) ^ 
. y.(n,^,„,.)=mm(^exp(- — ),-y -^^-^^ j. 

Since fe > 2, notice the previous bound can itself be bounded by 

,^sv. . ■ , 2ne2 16 /(Fc ln(27i -f 1) + 4) \ 

5(2. + l)-^exp(--)+mm (^2cxp(- — ), ^ ^ n ) " 

In fact, the bound for the variance term {V) can be improved by averaging the fc training errors. 
This step emphasizes the interest of fc-fold cross-validation against simpler cross-validation. 
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Proposition 4.7 (k-fold) Suppose that % holds. Then, in the case of the k-fold cross-validation 
procedure, we have for all e > 0; 



Pr{Rcv - Rn{i-p„) > e) < 2p" exp 



2 

ne 



64(VVtln(2(2np„ + l)) + 2) / ' 
Thus, averaging the observed errors to form the fc-fold estimate improves the term Vc from 



. ,„ , 32npne\ 14 T/c(ln(2(l - p„) + 1) + 4) 

mm(2exp( ), — W r ). 

^ ^ 49 ^' e Y n(l-p„) ' 

to 2 exp I — I . This result is important since it shows why intensive 

V 64(V^ln(2(2np„ + l)) + 2); 

use of the data can be very fruitful to improve the estimation rate. Another interesting consequence 

of this proposition is that, for a fixed precision e, the size of the test is not required to grow to 

infinity for the exponential convergence of the cross-validation procedure. For this, it is sufficient 

that the size of the test sample is larger than a fixed number uq. 

Proof. 

Recall that the size of the training sample is n(l — Pn), and the size of the test sample is then np„. 
For this proposition, we have p„ < | 

We are interested in the behaviour of Rev — Rn(i-p) — "^v^i^-Rv^' {'t'v^l'-) — lEy^> i?(0y^' ) which is a 
sum of ^ = fc terms in the case of the fc-fold cross-validation. 

The difficulty is that these terms are neither independent, nor even exchangeable. We have in mind 
to apply the results about the sum of independent random variables. For this, we need a way 
to introduce independence in our samples. In the same time, we do not want to lose too much 
information. For this, we will introduce independence by using by using the supremuni. We have, 

¥v{Rcv - Rn{i-p) >e) = Vv{Evt.{Rvt. {<j)v^) - R{^v^^r)) > e) 

< Pr(Ev,*.(sup^gc(^y-('/') - RW) > e)- 

Now, we have a sum of fc = ^ i.i.d terms: P(j: J^^i^ c)j '^i^^ = sup^g^C^v^^C^) — R{<l>))- 
However, we have an extra piece of information: an upper bound for the tail probability of these 
variables, using the concentration inequality due to Vapnik (1998). 

Pr(sup(i?y*a(0) - i?(0)) > e) < c(np„, Vc)e^5^(^. 
•pec 

with cin,Vc) = 2S{2n,C )< 2(2n + 1)^^ and a{n)^ = ^. 

In fact, summing independent bounded variables with exponentially small tail probability gives us 
a better concentration inequality than the simple sum of independent bounded variables. 
To show this, we proceed in three steps: 

1. the g-Holder norms of each variable is uniformly bounded by ^/q, 

2. the Laplace transform of Yi is smaller than the Laplace transform of some particular normal 
variable, 

3. using ChernofF's method, we obtain a sharp concentration inequality. 
1. First step (for the proof, see appendices), we prove 
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Lemma 4.6 Let Y a random variable (bounded by 1) with subgaussian tail probability ¥{Y > 

e) < ce~2^ for all e > with > and c > 2. Then, there exists a constant 7 such that, 
for every integer q, 

with 7 = ((T-\/4 ln(c) + 7r3332e^5cr)^. 



2. Second step (see exercise 4 in Lugosi (2003)), we have 



Lemma 4.7 // there exists a constant 7, such that for every integer q 
then we have 

E{e'^) < V2eh'^. 
3. Third step, we have the result using Chernoff's method. 
Lemma 4.8 //, for some a > 0, f3 > 0, we have: 

E(e"^) < ae^ 

then if {Yi)Ki<n o,re i.i.d., we have: 

P(-^y, > e) < a^e^ 

i=l 



Putting lemma 4.6||4.7|[48 together, we eventually get: 



p" exp 



0ec " y2(T(np„)2(e2^41n(c(np„, Vc)) +7r4332)2y 

□ 

Symmetricahy, we obtain: 

Proposition 4.8 (k-fold bis) Suppose thafH holds. Then, in the case of the k-fold cross-validation 
procedure, we have for all e > 

( \ 

P(i?«(i-p„) - Rev > e) < 2?^ exp -- 



64(v/K^In(2(2np;^+l)) + 2)/ ■ 
Eventually, we have a control on the absolute deviation 

Theorem 4.9 (Absolute error for the k-fold) Suppose that H holds. Then, in the case of the 
k-fold cross-validation procedure, we have for all £ > 0, 

Pr(|^„ - Rcv\ > e) < Bk{n,Pn,e) + Vk{n,pn,e) 

with 
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. Bk(n,pn,e) = 5(2n(l - 1/fc) + exp(- — ) 



Vk{n,pn,e) 



. . . 2n/e\ 16 Fc(ln(2(l - 1/fc) + 1) + 4) 

mmiexpl — ), — W -, — , 

^ 25fc £ Y n{l - 1/fc) 

22^^ exp , ) . 

25 * 64(v/Vc ln(2(2np„ + 1)) + 2) 

4.5 Hold-out cross-validation 

For hold-out cross-validation, the symmetric condition that for all i, Pr (i S "Dytr) is independent of 
J is no longer valid. Indeed, in the hold-out cross-validation (or split sample), there is no crossing 
again. 

In the next proposition, we suppose that the training sample and the test sample are disjoint and 
that the number of observations in the learning sample and in the test sample are still respectively 
n(l — Pn) and npn- Moreover, we suppose also that the predictors 4>n are empirical risk minimizers 
on a class C with finite Vc-dimension Vc and L a loss function bounded by 1. We denote these 
hypotheses by Q. 

We get the following result 

Theorem 4.10 (Hold-out) Suppose that Q holds. Then, we have for all e > 0, 
Pr(|_R„ - Rcv\ > e) < Bh.oid{n,Pn,e) + Vhoid{n,Pn,e) 

with 

• BhoM{n,Pn,e) = 8(2n(l -p„) + if""- exp(-?^^^i^^^) 

• Vhoid{n.,Pn,£) = 2exp( ^ — ). 

25 

Proof. We just have to follow the same steps as in proposition |4.5| But in the case of hold-out 
cross-validation, notice that 



Pr(E^^,. snj>{Rvtr{(t)) ~ R{(t))) > e) = Pr(sup(i?vt- (0) - R{(l))) > e) 
" 0ec " 4>ec 



<5(2n(l-p„),C)V"(i-P")^' 



Moreover, the lemma 4.4 is no longer valid, since Eytr/Jyts (0„) ^ _R„. □ 



4.6 Discussion 

We base the next discussion on upperbounds, so the following heuristic arguments are questionable 
if the bounds are loose. 
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Crossing versus non-crossing 



One can wonder: what is the use of averaging again over the different folds of the fc-fold cross- 
vahdation, which is time consuming? As far as the expected errors are concerned, the upper bounds 
are the same for crossing cross-validation procedures and for hold-out cross-validation. But suppose 
we are given a level of precision e, and we want to find an interval of length Is with maximal 

confidence. Then notice that Bgy^i/ B^oij^ = (2n(l — Pn) + 1) i-P" exp(— np„£^). Thus if p„ is 
constant, Bsym/ Bhoid -^n^oo 0: the term B will be much greater for hold-out based on large 
learning size. On the contrary, if the learning size is small, then the term B is smaller for non 
crossing procedure for a given p„ . This might due to the absence of resampling. 
Regarding the variance term Vhouin-iPn ■,£)■, we need the size of the test sample to grow to infinity 
for the consistency of the hold-out cross-validation. On the contrary, for crossing cross-validation, 
the term V converges to whatever the size of the test is. 



fc-FOLD CROSS-VALIDATION VERSUS OTHERS 

If we consider the Li error, the upper bounds are the same for crossing cross-validation procedures 
and for other cross-validation procedures. But if we look for the interval of length 2e with maximal 
confidence, then notice that Vk/Vgym -^n^oo (with Vk^Vgym defined respectively in theorems 



4.9 



4.5 ) if the number of elements in the training sample npn is constant and large enough. Thus, if the 
learning size is large enough, the V term is much smaller for the fc-fold cross-validation, thanks to 
the crossing. 



Estimation curve 

The expression of the variance term V depends on the percentage of observations p„ in the test 
sample and on the type of cross-validation procedure. We have thus a control of the variance term 
depending on p„. 
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We can define the estimation curve (in probability or in Li norm) which gives for each cross- vaHdation 
procedure and for each p„ the estimation error. 

Definition 4.1 (Estimation curve in probability) Let e > 0; 

AC ■.pn'->- B{n,pn,e) + V{n,pn,e). 



with B[n,pn,e) and V{n,p„,e) defined in theorem 4-5 



This can be done with the expectation of the absolute of deviation or with the probability upper 
bound if the level of precision is e. 

Definition 4.2 (Estimation curve in Li norm) 

AC ■.pn^~> B{n,Pn) + V{n,Pn). 



with B{n,Pn) and V{n,Pn) defined as in proposition 4. 2 



We say that the estimation curve in probability experiences a phase transition when the convergence 
rate y(n,p„,£) changes. The estimation curve experiences at least one transition phase. The 
transition phases just depend on the class of predictors and on the sample size. On the contrary of 
the learning curve, the transition phases of the estimation curve are independent of the underlying 
distribution. The different transition phases define three different regions in the values of p„ the 
percentage of observations in the test sample. This three regions emphasize the different roles played 
by small test sample cross-validation, large test samples cross-validation and fc-fold cross-validation. 

Optimal splitting and confidence intervals 

The estimation curve gives a hint for this simple but important question: how should one choose the 
cross-validation procedure in order to get the best estimation rate? How should one choose k in the 
/c-fold cross-validation? The quantitative answer of theses questions is the arg min of the estimation 
curve AC. 

That is in probability 

(e) = arg min AC{pn,e). 

or in Li norm: 

= argmin^C(j3„). 

As far as the Li norm is concerned, we can derive a simple expression for the choice of p„. Indeed, if 
we use chaining arguments in the proof of proposition |4.1[ that is: there exists a uni versal constant 

c > such that Esup^gc(-^wf,' (</') ~ Ri.4>)) ^ '^x n(i-p ) (^"-"^ ^'^^ proof, see e.g. 



(1996)). The proposition 4.2 thus becomes 



Devroye et al. 



Corollary 4.5 (ii error for large test sample) Suppose thatH holds. Then, there exists a uni- 
versal constant c > such that: 



E\Rcv-Rn\ < 



n(l - pn) V np. 
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We can then minimize the last expression inp„. After derivation, we obtain p* = + ^. 

Thus, the larger the VC-dimension is, the larger the training sample should be. Since it may be 
difficult to find an explicit constant, one may try to solve: ^ n"i -p^ '^)^'^ ^ \/ »p~ ' obtain then 
a computable rule p* = ((MMW)))i/3 + ^yi 

Another interesting issue is: knowing the number of observations n and the class of predictors, we 
can now derive an optimal minimal 1 — a-confidence interval, together with the cross-validation 
procedure. We look at the values {e,Pn) such that the upperbound B{n,pn,e) + T^(n,p„,e) is below 
the threshold a. Then, we select the couple (e*,p*) among those values for which e is minimal. On 
figurejl] we fix a choice of a = 5%. We observe that, for values of n between 1000 and 10000 and for 
small VC-dimension, a choice of p ~ 10%, i.e. the ten- fold cross-validation, seems to be a reasonable 
choice. 



Optimal 

confidence interval and sample plitting (apha=0.05) 
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Figure 1: Upperbounds for cross-validation procedures with different splitting 
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5. Appendices 

5.1 Notations and definitions 

We recall the main notations and definitions. 



Name 


Notation 


Definition 




Generalisation error 


Rn 


Ep[L(y,<^(X,D„)) 1 




Resubstitution estimate 


Rn 






Cross-validation estimate 


Rev 


Ey^T- Rvji" {'Pv*'' ) 




Cross-validation risk 


Rn{l-p) 


Evt.R{^vi-) 




Optimal error 


Ropt 


inf^gc R{(l>) 





Table 1: Main notations 



5.2 Proofs 

We recall three very useful results. The first one, due to 'Hoeffdingj (1963), bounds the difference 
between the empirical mean and the expected value. The second one, due to Vapnik et al. ( 1971 1 , 
bounds the supremum over the class of predictors of the difference between the training error and 
the generalization error. The last one is called the bounded differences inequality McDiarmid ( 1989 ) 



Theorem 5.1 (Hoeffding (1963)) Let Xi X^ independent random variables in [0^,6^]. Then 
for all e > 0, 



Theorem 5.2 ( Vapnik et al.| ( |1971 )) Let C a class of predictors with finite VC-dimension and 
L a loss function hounded by 1. Then for all e > 0, 



P(sup(i?„(0) - L{(j,)) >e)< c(n, Vc)e 
(t>ec 

with c(n, Vc) = 25(2n,C;< 2(2n + lY<' and ifn> Vc, 2S{2n,C)< "^{^Y" and (7{n) 



2 _ 4 



Theorem 5.3 (McDiarmid) Let Xi...._X„ be independent random variables taking values in a 
sample X , and assume that f : A"" — > TZ satisfies 

Vi, sup \f{xi,...,X„) - f{xi,...,Xt>,...,Xn)\ < Ci. 



Then, for all e > 0, 



P(/(Xi, ...,X„) - E/(Xi, ...,X„) > e) < e 



5.2.1 Proof of lemma 14.11 

First, notice that 

P{Eyt^ anp{Rvt-{(f>) - R{(l))) - EEy*. sup(^yt.(0) - R{(f))) > e) < e~' 



21 



using McDiarmid's inequality by setting f{X\, . . . = '^v*^ sup0£c(-^v;S''('A) ~ ^{4')) and since 
for all i. 



sup |Eyt. SUp(i?ytr.(</,) - i?(</))) - Eyt. SUp(i?yt.(0) - E((/>))| 

xi,...,xi,...,x„ " " " <j>ec " 



= sup 



< sup Eytr 
a:i,...,Xi,...,x,i ^ 



sup(iiy^*r(^) - i?(0)) - snv{Rytr{(i)) - R{(f>)) 



SUp(i?y^tr((?i) - - sup{Rvtri(f>) - R{(f>)) 



by Jensen's inequality 

< sup Eytr sup \Rvtr{(f>) — iiytr((?i)| 

a;i,...,Xi,...,x„ " 0gc " " 

since | sup / - sup < sup |/ - 5] 
1 

< . 

n 

Indeed, if we note Q the number of elements in the sum Ey^r , the number of changes is lower than 
— ^( ra(i-p ) multiplied by the number of times i' in the learning sample) that is ) Q{^ ~ 

Pn)) = \ 

Furthermore, we have 



EEy^tr sup(i?v„*r((;i) - R{^y) = Esup(i?vj;-(?!') - R{4>)) 
4>ec " 4>ec 

with v^T a fixed vector 



< 



/21n(5(2n(l-p„),C) 



n{l-pn) 
by Vapnik-Chernovenkis's inequality. 

Thus, if we denote Pr (E^t^ sup^g^ (-^v^'" (?^) ~ — ^) -fi it leads to 

Pi =Pr(E^t, sup(^v'*'-(0) - i?(0)) - EEyt. sup(^ytr.(0) - R{4,)) 
" 4>ec " " <t>ec 

> £ - EEytr SUp(.Rytr-(0) - R{(t>)). 



Then, using the two previous inequalities 



Pi < Pr (E^t, sup(i?y*'-('/>) - R{(t>)) - EEyt. sup(i?ytr-(0) - R{(t>)) 



0ec 



_ / 21n(5(2n(l-p„),C) 
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Since 2{u — w)^ > — 2w, it follows 



A<exp(-2n(e-^ ^ ) ) < exp(-n(s ^^^-^-^ ))) 

<5(2n(l-p„),C)4/(i-f")exp(-ne2). 

□ 



5.2.2 Proof of lemma 14.11 

Recall that Rev — Ev*'^R^rts i't'v*'') 

But by definition of we have Rn{4'n) < Rniipv^^^)- 

It follows that ^{npnR^,^ i'f>n)+E^evJ,^ i(y.>„(X,)) < ^{npnR^,. {<t>v,t^) +E»ey„*^ ^l^^^^y- (^»))- 
Thus, since X^igv-^r -L(yi_0„(Xi)) > X^igy^t^ i(i^i,</>y^'- (-'^^i)) by definition of 0v,5'-, we have (0„) < 

From this, we deduce Rev — ^v^^-R^ts {4>v^^) > ^v*^-Ryts {4>n) — Rn- 
□ 



5.2.3 Proof, of lemma 14.41 

, , EX+ EX P(X_ > x)dx C P(X < -x)dx 

Ve > 0,P(X > e) < f{X+ >£)< = = ^~ — = ^^^-^ — ^ — . 

□ 



5.2.4 Proof, of lemma 14.61 

First, suppose that q > I and notice that 

EY+'' = j;;^ qy^-^V{Y+ > y)dy 

We thus deduce that because of the subgaussian inequality: 

EY+^ < qj,'^^^y^-'dy + qr^,cy''~^e~^dy. 



/4 ln( 

Then, with M a standard normal: 

Ey+« < {<j^lH^)^ + qcJ^^^^^y'^-'e-^dy 

This gives by Cauchy-Schwarz's inequality: 

Er+« < (aV'4d^^)9+qcy2^CT«(EA/'2(9-i)lo<jv-)3(P(y4h^(^< 
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It leads to, since EAA^p = and^41n(c) > 1, 



Er+« < (av5lnR)« + (27r)i/V(^«(EAA2(9-i))5(e-^^)5 
We obtain, since V27rn(^)"eTis < n! < V27rn(|)"eT2s+T, 



\^2(<!-i)^27r(g-l)(i2^)'=(<!-i)e 12(9-1) 

< (tryilnp)'? + (27r)i/4g(T«(V2(?^^)('?-i)e2*(<'-i)+i"^3(i^) 

< (av/4h^)9 + (27r)i/Vj«^«(^^)^- 

\ - - - 

Thus, since (a + 6) « < ai + 6« , a, 6 > 0: 

(Er+'')i < ((av/4Mc))'' + (2^)i/4^2Ja9(^^)^)' 

which gives since < 35, (^^^)'^ < (^)'^ < (^)* since ^ > 1: 



(Ey+«)i < a 741n(^ + (7i((27r)i/42i)i^(2(g^y 

< C7^/4h^(^+3323C7(f )3. 



2, 



< CTA/41n(c) + (27r)^/''33 2ie-5c7yg 

< (a^/4h^(^+(27r)V43l2te-5c7)^ 



with 7 = (CTv/4lK(^+ (27r)i/43s 2 1 6-5(7)2 
For g = 1, notice that: 

(Er+«)^ < o-A/41n(c) + io- 
< \/^- 

□ 
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